A Florida health insurance company wants to predict annual claims for individual clients. The company pulls a random sample of 50 customers. The owner wishes to charge an actuarially fair premium to ensure a normal rate of return. The owner collects all of their current customer’s health care expenses from the last year and compares them with what is known about each customer’s plan.

The data on the 50 customers in the sample is as follows:

  • Charges: Total medical expenses for a particular insurance plan (in dollars)
  • Age: Age of the primary beneficiary
  • BMI: Primary beneficiary’s body mass index (kg/m2)
  • Female: Primary beneficiary’s birth sex (0 = Male, 1 = Female)
  • Children: Number of children covered by health insurance plan (includes other dependents as well)
  • Smoker: Indicator if primary beneficiary is a smoker (0 = non-smoker, 1 = smoker)
  • Cities: Dummy variables for each city with the default being Sanford

Answer the following questions using complete sentences and attach all output, plots, etc. within this report.

Question 1

Randomly select three observations from the sample and exclude from all modeling (i.e. n=47). Provide the summary statistics (min, max, std, mean, median) of the quantitative variables for the 47 observations.

Table Summary of Quantitative Variables (except Children) for the 47 observations
Characteristic N = 47
Charges
Mean (SD) 12,317 (11,498)
Median (IQR) 8,604 (4,480, 13,552)
Range 2,494, 55,135
Age
Mean (SD) 42 (13)
Median (IQR) 43 (30, 53)
Range 23, 64
BMI
Mean (SD) 29.0 (5.6)
Median (IQR) 28.5 (25.3, 32.4)
Range 16.8, 42.1
Children Summary Data
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   1.234   2.000   5.000
Children Standard Deviation
## [1] 1.18345

Question 2

Provide the correlation between all quantitative variables

Question 3

Run a regression that includes all independent variables in the data table. Does the model above violate any of the Gauss-Markov assumptions? If so, what are they and what is the solution for correcting?

Summary Regression Output of all independent variables (n=47)
## 
## Call:
## lm(formula = Charges ~ ., data = insurance.new)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11888  -2726  -1065    711  20257 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -14022.39    6563.47  -2.136 0.039145 *  
## Age              287.26      77.04   3.729 0.000626 ***
## BMI              434.97     200.14   2.173 0.036058 *  
## Female           858.33    2120.59   0.405 0.687923    
## Children         118.17     873.64   0.135 0.893122    
## Smoker         23108.13    3009.97   7.677 3.04e-09 ***
## WinterSprings  -1659.04    3069.60  -0.540 0.592024    
## WinterPark     -4853.57    3009.55  -1.613 0.115080    
## Oviedo         -3769.38    2566.29  -1.469 0.150115    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6722 on 38 degrees of freedom
## Multiple R-squared:  0.7176, Adjusted R-squared:  0.6582 
## F-statistic: 12.07 on 8 and 38 DF,  p-value: 2.224e-08
Plot of Insurance Data and Scatterplot Matrix of all Quantitative Variables

3rd Assumption - Nonlineraity. Residuals v Fitted. FUNCTIONAL FORMS. - Consider using ratios or percentages rather than raw data (see module on multicollinearity for a complete discussion of the associated problems and causes)

6th Assumption - Normal Distribution Is Not In Place. [Normal Q-Q)] - look for subgroups in data and analyze separately; use summary data (like the mean value) rather than the raw data

4th Assumption - Heteroskedaticity Is Occuring Within Scale-Location

Question 4

Implement the solutions from question 3, such as data transformation, along with any other changes you wish. Use the sample data and run a new regression. How have the fit measures changed? How have the signs and significance of the coefficients changed?

Scatterplot Matrix’s for the Log of Charges and the Insurance Data minus Dummy Variables

Summary Regression Model with Log of Charges
## 
## Call:
## lm(formula = LogCharges ~ ., data = insurance.new[, c(10, 2:9)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.65510 -0.14862 -0.05322  0.03263  1.28444 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.033276   0.387771  18.138  < 2e-16 ***
## Age            0.034991   0.004552   7.688 2.94e-09 ***
## BMI            0.011547   0.011824   0.977    0.335    
## Female         0.054880   0.125285   0.438    0.664    
## Children       0.063550   0.051615   1.231    0.226    
## Smoker         1.324284   0.177829   7.447 6.16e-09 ***
## WinterSprings -0.007282   0.181353  -0.040    0.968    
## WinterPark    -0.051822   0.177804  -0.291    0.772    
## Oviedo        -0.144341   0.151617  -0.952    0.347    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3972 on 38 degrees of freedom
## Multiple R-squared:  0.7924, Adjusted R-squared:  0.7487 
## F-statistic: 18.13 on 8 and 38 DF,  p-value: 8.493e-11
Model: Age with a logarithmic Shape
## 
## Call:
## lm(formula = LogCharges ~ ., data = insurance_LogChrgAgeWDummy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58853 -0.17786 -0.05451  0.02616  1.27653 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.17330    0.73015   4.346 9.98e-05 ***
## LogAge         1.42556    0.18545   7.687 2.95e-09 ***
## BMI            0.01451    0.01178   1.232    0.225    
## Female         0.06560    0.12535   0.523    0.604    
## Children       0.05664    0.05168   1.096    0.280    
## Smoker         1.32511    0.17782   7.452 6.07e-09 ***
## WinterSprings -0.02476    0.18155  -0.136    0.892    
## WinterPark    -0.07879    0.17815  -0.442    0.661    
## Oviedo        -0.14899    0.15168  -0.982    0.332    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3972 on 38 degrees of freedom
## Multiple R-squared:  0.7924, Adjusted R-squared:  0.7487 
## F-statistic: 18.13 on 8 and 38 DF,  p-value: 8.507e-11

Plots for Model results for Log of Charges and Log of Age

Scatterplot Martix of Log of Charges/Log of Age compared to Log of Charges and Age Squared

Model: Age with a Quadratic Relationship
## 
## Call:
## lm(formula = LogCharges ~ ., data = insurance.new[, c(12, 2:10)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.62995 -0.14987 -0.05370  0.02717  1.28495 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.7322920  0.9363606   7.190 1.59e-08 ***
## AgeSq         -0.0001643  0.0004640  -0.354    0.725    
## Age            0.0492269  0.0404749   1.216    0.232    
## BMI            0.0124770  0.0122478   1.019    0.315    
## Female         0.0605778  0.1277695   0.474    0.638    
## Children       0.0598072  0.0532787   1.123    0.269    
## Smoker         1.3245151  0.1799132   7.362 9.39e-09 ***
## WinterSprings -0.0149998  0.1847672  -0.081    0.936    
## WinterPark    -0.0626046  0.1824473  -0.343    0.733    
## Oviedo        -0.1470754  0.1535865  -0.958    0.344    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4018 on 37 degrees of freedom
## Multiple R-squared:  0.7931, Adjusted R-squared:  0.7428 
## F-statistic: 15.76 on 9 and 37 DF,  p-value: 3.566e-10

Summary Model for BMI with a Logarithmic shape
## 
## Call:
## lm(formula = LogCharges ~ ., data = insurance.new[, c(2, 4:10, 
##     13)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.65610 -0.15185 -0.05397  0.02865  1.27595 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.313609   1.106364   5.707 1.44e-06 ***
## Age            0.034944   0.004564   7.656 3.25e-09 ***
## Female         0.056410   0.125504   0.449    0.656    
## Children       0.064999   0.051857   1.253    0.218    
## Smoker         1.323267   0.177896   7.438 6.32e-09 ***
## WinterSprings -0.005992   0.181873  -0.033    0.974    
## WinterPark    -0.045362   0.176489  -0.257    0.799    
## Oviedo        -0.140444   0.151103  -0.929    0.359    
## LogBMI         0.314013   0.330373   0.950    0.348    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3974 on 38 degrees of freedom
## Multiple R-squared:  0.7921, Adjusted R-squared:  0.7484 
## F-statistic:  18.1 on 8 and 38 DF,  p-value: 8.695e-11

Plot of Model for Log of Charges with Dummy Variables

Scatterplot Martix of Log of Charges/Log of BMI compared to Log of Charges and BMI Squared

Model: BMI with a Quadratic Relationship
## 
## Call:
## lm(formula = LogCharges ~ ., data = insurance.new[, c(2:10, 14)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.65467 -0.14654 -0.04853  0.03424  1.28639 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.111e+00  1.356e+00   5.245 6.61e-06 ***
## Age            3.502e-02  4.643e-03   7.543 5.43e-09 ***
## BMI            6.116e-03  9.201e-02   0.066    0.947    
## Female         5.393e-02  1.280e-01   0.422    0.676    
## Children       6.287e-02  5.354e-02   1.174    0.248    
## Smoker         1.324e+00  1.802e-01   7.349 9.77e-09 ***
## WinterSprings -8.169e-03  1.844e-01  -0.044    0.965    
## WinterPark    -5.396e-02  1.837e-01  -0.294    0.771    
## Oviedo        -1.452e-01  1.542e-01  -0.941    0.353    
## BMISq          9.296e-05  1.562e-03   0.060    0.953    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4025 on 37 degrees of freedom
## Multiple R-squared:  0.7924, Adjusted R-squared:  0.7419 
## F-statistic: 15.69 on 9 and 37 DF,  p-value: 3.779e-10

When applying the solutions for the Gauss-Markov Assumptions that were violated we calculated and compared the following:

  1. Log of Charges
  2. Log of Charges and Log of Age
  3. Log of Charges and Age Squared
  4. Log of Charges and Log of BMI
  5. Log of Charges and BMI squared

Overall, our measure of fit for each Regression improved. Resulting in our SEE reducing from 6722 to around .40 in addition to R-Squared and Adjusted R-Squared increasing from 72 & 66 to around 80 & 75 for all models.

Below are the results coefficient significance and sign changes:

1. Log of Charges
- BMI is no longer significant
- Smoker is now more significant
- Age is now slightly more significant

2. Log of Charges and Log of Age
- BMI is no longer significant
- Smoker is now more significant
- Age is now slightly more significant

3. Log of Charges and Age Squared
- BMI and Age are no longer significant
- Smoker is now more significant

4. Log of Charges and Log of BMI
- BMI is no longer significant
- Age and Smoker are more significant

5. Log of Charges and BMI squared
- BMI is no longer significant
- Age and Smoker are more significant

Question 5

Use the 3 withheld observations and calculate the performance measures for your best two models. Which is the better model? (remember that “better” depends on whether your outlook is short or long run)

insurance.test$LogCharges <- log(insurance.test$Charges)
insurance.test$BMISq <- insurance.test$BMI^2
insurance.test$AgeSq <- insurance.test$Age^2
insurance.test$bad_model_pred <- predict(model, newdata = insurance.test)

insurance.test$model_1_pred <- predict(model_LogChrgBMISq,newdata = insurance.test) %>% exp()

insurance.test$model_2_pred <- predict(model_LogChrgAgeSq,newdata = insurance.test) %>% exp()

# Finding the error

insurance.test$error_bm <- insurance.test$bad_model_pred - insurance.test$Charges

insurance.test$error_1 <- insurance.test$model_1_pred - insurance.test$Charges

insurance.test$error_2 <- insurance.test$model_2_pred - insurance.test$Charges
Bias for the Bad Model, Model 1, & Model 2
## [1] 2096.91
## [1] 240.616
## [1] 356.8711
MAE for the Bad Model, Model 1, & Model 2
## [1] 5282.157
## [1] 412.3407
## [1] 512.8377
RMSE for the Bad Model, Model 1, & Model 2
## [1] 6720.431
## [1] 429.0247
## [1] 584.066
MAPE for Bad Model, Model 1, & Model 2
## [1] 0.6206971
## [1] 0.07086708
## [1] 0.07259645

The initial model performed the worst when compared to the other two. When compared to the other two, the bias, MAE, and MAPE of the logarithmic connection are lower. Since Model 2’s RMSE is lower, there were no significant prediction mistakes. Depending on your preferred time frame, you could choose any model. Model 2 is appropriate if you’re considering the near future. If you are considering the long term, choose Model 1.

Question 6

Provide interpretations of the coefficients, do the signs make sense? Perform marginal change analysis (thing 2) on the independent variables.

Summary model for Log of Charges and BMI Squared
## 
## Call:
## lm(formula = LogCharges ~ ., data = insurance.new[, c(2:10, 14)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.65467 -0.14654 -0.04853  0.03424  1.28639 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.111e+00  1.356e+00   5.245 6.61e-06 ***
## Age            3.502e-02  4.643e-03   7.543 5.43e-09 ***
## BMI            6.116e-03  9.201e-02   0.066    0.947    
## Female         5.393e-02  1.280e-01   0.422    0.676    
## Children       6.287e-02  5.354e-02   1.174    0.248    
## Smoker         1.324e+00  1.802e-01   7.349 9.77e-09 ***
## WinterSprings -8.169e-03  1.844e-01  -0.044    0.965    
## WinterPark    -5.396e-02  1.837e-01  -0.294    0.771    
## Oviedo        -1.452e-01  1.542e-01  -0.941    0.353    
## BMISq          9.296e-05  1.562e-03   0.060    0.953    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4025 on 37 degrees of freedom
## Multiple R-squared:  0.7924, Adjusted R-squared:  0.7419 
## F-statistic: 15.69 on 9 and 37 DF,  p-value: 3.779e-10

Positive Intercept. Age Increases In A Linear Fashion As Does Charges. As BMI Increases So Does Charges. If Client Is Female Charges Increases Which Makes Sense For Pregnancy Charges. All Locations Help Decreases Charges Unless Default At Sanford.

Of All SEEx2 Tests - Children Appears To Show The Most Room For Error

Question 7

An eager insurance representative comes back with five potential clients. Using the better of the two models selected above, provide the prediction intervals for the five potential clients using the information provided by the insurance rep.

Customer Age BMI Female Children Smoker City
1 60 22 1 0 0 Oviedo
2 40 30 0 1 0 Sanford
3 25 25 0 0 1 Winter Park
4 33 35 1 2 0 Winter Springs
5 45 27 1 3 0 Oviedo
## 
## Call:
## lm(formula = LogCharges ~ ., data = insurance.new[, c(2:10, 14)])
## 
## Coefficients:
##   (Intercept)            Age            BMI         Female       Children  
##     7.111e+00      3.502e-02      6.116e-03      5.393e-02      6.287e-02  
##        Smoker  WinterSprings     WinterPark         Oviedo          BMISq  
##     1.324e+00     -8.169e-03     -5.396e-02     -1.452e-01      9.296e-05
##         fit      lwr      upr
## 1 10940.686 4345.449 27545.74
## 2  6915.164 2941.044 16259.36
## 3 12933.267 4787.337 34939.97
## 4  6410.797 2541.912 16168.27
## 5  8240.879 3380.672 20088.34

Question 8

The owner notices that some of the predictions are wider than others, explain why.

**The largest range for the group of customers is customer #3. They are a 25 year old male smoker with no children living in Winter Park.

Question 9

Are there any prediction problems that occur with the five potential clients? If so, explain. Verbal Response Reference Questions 1& 2